Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

314 ◾ Bioinformatics

-1 fastq_pure/ERR1823587_pure_R1-50.fastq.gz \

-2 fastq_pure/ERR1823587_pure_R2-50.fastq.gz \

--report-file centrifuge_out/ERR1823587-report.txt \

-S centrifuge_out/ERR1823587-results.txt

centrifuge -x p+h+v \

-1 fastq_pure/ERR1823601_pure_R1-50.fastq.gz \

-2 fastq_pure/ERR1823601_pure_R2-50.fastq.gz \

--report-file centrifuge_out/ERR1823601-report.txt \

-S centrifuge_out/ERR1823601-results.txt

centrifuge -x p+h+v \

-1 fastq_pure/ERR1823608_pure_R1-50.fastq.gz \

-2 fastq_pure/ERR1823608_pure_R2-50.fastq.gz \

--report-file centrifuge_out/ERR1823608-report.txt \

-S centrifuge_out/ERR1823608-results.txt

The results are saved in “*-results.txt” files. Each read classified by Centrifuge results in

a single line of output. The output lines consist of eight tab-delimited fields: (1) the read

ID (from FASTQ file); (2) sequence ID (from the database sequence); (3) taxonomic ID of

the database sequence; (4) classification score (weighted sum of hits); (5) score for the next

best classification; (6) two numbers: (i) a number of base pairs of the read that match the

database sequence and (ii) the length of a read or the combined length of mate pairs; (7)

two numbers: (i) a number of base pairs of the read that match the database sequence and

(ii) the length of a read or the combined length of mate pairs; and (8) the number of clas-

sifications for this read.

The “*-report.txt” files contain summaries of the identified taxa and their abun-

dances. Each line in the file consists of seven tab-delimited fields: The name of a genome,

FIGURE 8.2 Partial centrifuge report for the healthy sample.